On loade la matrice de reads par échantillon et par gènes obtenue par htseq-count. On supprime les dernières lignes qui correspondent aux reads non identifiés comme des gènes ou comme ambigus.
data = read.table("quantif_all.txt", h = T)
rownames(data) = data$Gene
data = dplyr::select(data, -c("Gene", "WT.K1"))
genes = which(!(grepl("__", rownames(data))))
data = data[genes, ]
data = data[, sort(colnames(data))]
kable(head(data))| WT.K2 | WT.K3 | WT.N1 | WT.N2 | WT.N3 | |
|---|---|---|---|---|---|
| AT1G01010.1 | 5 | 2 | 3 | 6 | 3 |
| AT1G01020.1 | 0 | 0 | 0 | 1 | 0 |
| AT1G01020.2 | 0 | 0 | 0 | 0 | 0 |
| AT1G01030.1 | 4 | 2 | 0 | 0 | 0 |
| AT1G01040.1 | 15 | 8 | 18 | 14 | 14 |
| AT1G01040.2 | 0 | 0 | 0 | 0 | 0 |
On rensigne le design et le type d’échantillons.
design = data.frame(row.names = colnames(data), condition = c("untreated", "untreated",
"treated", "treated", "treated"), libType = rep("single-end", 5))
kable(design)| condition | libType | |
|---|---|---|
| WT.K2 | untreated | single-end |
| WT.K3 | untreated | single-end |
| WT.N1 | treated | single-end |
| WT.N2 | treated | single-end |
| WT.N3 | treated | single-end |
D’après les 2 papiers de review lus (et détails techniques sur https://hbctraining.github.io/DGE_workshop/lessons/02_DGE_count_normalization.html) :
RPKM non adaptés pour la DE : Permettent la comparaison entre gènes d’un même échantillon en corrigeant la profondeur de séquençage et la taille des gènes, mais ne permet pas de comparer des échantillons différents. (La somme des counts normalisés en RPKM ne sera pas la même pour deux échantillons). De plus, on compare des gènes identiques uniquement, donc la taille n’est pas importante.
DESeq : la médiane de ratios : On calcule un sizeFactor propre à chaque échantillon pour corriger la profondeur de séquençage. On divise chaque count par la moyenne géométrique des counts de son gène parmis tous les échantillons (Comme si on créait un nouvel échantillon de pseudo-référence. Les ratios dans un même échantillon devraient être similaires entre la plupart des gènes, qui sont en majorité non differentiellement exprimés). Le size factor d’un échantillon est la médiane de ces ratios pour tous les gènes. Pour normaliser, on divise tous les counts d’un échantillon par son sizeFactor. (On ne considère pas dans la médiane les gènes ayant une moyenne géométrique de 0)
TMM :
à voir avec EdgeR.
cds = estimateSizeFactors(cds)
keep <- rowSums(counts(cds)) >= 10
cds <- cds[keep, ]
sizeFactors(cds) WT.K2 WT.K3 WT.N1 WT.N2 WT.N3
1.2619801 0.4435494 1.2221518 1.1584047 1.4073152
par(mfrow = c(1, 2))
heatmap(counts(cds)[sample(rownames(counts(cds, normalized = FALSE)), 500),
], main = "Sans normalisation")heatmap(counts(cds, normalized = TRUE)[sample(rownames(counts(cds, normalized = TRUE)),
500), ], main = "Avec normalisation")The variance seen between counts is the sum of two components: the sample-to-sample variation just mentioned, and the uncertaintyin measuring a concentration by counting reads. The latter, known as shot noise or Poisson noise, is the dominatingnoise source for lowly expressed genes. The former dominates for highly expressed genes. The sum of both, shot noiseand dispersion, is considered in the differential expression inference.Hence, the variancevof count values is modelled as \(v = s\mu + \alpha s^2\mu^2\),
where \(\mu\) is the expected normalized count value (estimated by the average normalized count value),\(s\) is the size factor for the sample under consideration, and \(\alpha\) is the dispersion value for the gene under consideration.
# Analyse d’expression différentielle
On continue maintenant avec DESeq2, qui utilise le même test statistique que DESeq mais avec des améliorations pour les estimations et le GLM. Test : maintenant que les variances ont été estimées, il devient facile de comparer les conditions avec un test, rbinom.
A decrire mieux et parler des modèles, tout ça tout ça.
On construit le dataset, puis on enlève les lignes dont la somme des reads ne dépasse pas 10.
dds <- DESeq2::DESeqDataSetFromMatrix(countData = data, colData = design, design = ~condition)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
dds$condition <- relevel(dds$condition, ref = "untreated")
dds <- DESeq2::DESeq(dds)
res <- results(dds)
resLFC <- lfcShrink(dds, coef = "condition_treated_vs_untreated", type = "apeglm")
resOrdered <- res[order(res$padj), ]
sum(res$padj < 0.05, na.rm = TRUE)[1] 88
On a peu de gènes DE, seulement 48 avec un seuil à 5% de p-value, et 60 à 10%.
Visu du gènes le plus différentiellement exprimé :
d <- plotCounts(dds, gene = which.min(res$padj), intgroup = "condition", returnData = TRUE)
ggplot(d, aes(x = condition, y = count)) + geom_point(position = position_jitter(w = 0.1,
h = 0)) + scale_y_log10(breaks = c(25, 100, 400))On essaie de voir ce qui pourrait être étrange dans les données pour avoir si peu de gènes :
[1] 1240
En prenant une condition sur le log2FoldChange on a beaucoup plus de gènes DE. Affichage en heatmap des gènes les plus différentiellement exprimés.
par(mfrow = c(1, 2))
thr = 0.05
res = na.omit(res)
DEgenes = rownames(res[abs(res$log2FoldChange) > 1, ])
mat = counts(dds, normalized = TRUE)[DEgenes, ]
heatmap(mat, main = "abs(Log2FoldChange) > 1 ")DEgenes = rownames(res[res$padj < thr, ])
mat = counts(dds, normalized = TRUE)[DEgenes, ]
heatmap(mat, main = "Significatifs")DEgenes_pval = rownames(res[res$padj < thr, ])
DEgenes_upreg = rownames(res[res$log2FoldChange > 1, ])
DEgenes = intersect(DEgenes_pval, DEgenes_upreg)
mat = counts(dds, normalized = TRUE)[DEgenes, ]
heatmap(mat, main = "Juste les significatifs uprégulés")On s’intéresse maintenant à la liste de gènes sélectionnés et à leur ontologie.
mart = useMart(biomart = "plants_mart", host = "plants.ensembl.org", dataset = "athaliana_eg_gene")
results <- getBM(filters = "ensembl_transcript_id", attributes = c("ensembl_transcript_id",
"description", "external_gene_name"), values = DEgenes, mart = mart)
rownames(results) = results$ensembl_transcript_id
r = results[DEgenes, ]
kable(r, caption = "Genes DE et leur GO")| ensembl_transcript_id | description | external_gene_name | |
|---|---|---|---|
| AT1G06000.1 | AT1G06000.1 | Glycosyltransferase (Fragment) [Source:UniProtKB/TrEMBL;Acc:W8PVE7] | UGT89C1 |
| AT1G07520.1 | AT1G07520.1 | GRAS family transcription factor [Source:TAIR;Acc:AT1G07520] | |
| AT1G08090.1 | AT1G08090.1 | High-affinity nitrate transporter 2.1 [Source:UniProtKB/Swiss-Prot;Acc:O82811] | NRT2.1 |
| AT1G08592.1 | AT1G08592.1 | Potential natural antisense gene, locus overlaps with AT1G08590 [Source:TAIR;Acc:AT1G08592] | |
| AT1G15550.1 | AT1G15550.1 | Gibberellin 3-beta-dioxygenase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q39103] | GA3OX1 |
| AT1G21310.1 | AT1G21310.1 | Extensin-3 [Source:UniProtKB/Swiss-Prot;Acc:Q9FS16] | EXT3 |
| AT1G24290.1 | AT1G24290.1 | AAA-type ATPase family protein [Source:UniProtKB/TrEMBL;Acc:O48696] | |
| AT1G25550.1 | AT1G25550.1 | Transcription factor HHO3 [Source:UniProtKB/Swiss-Prot;Acc:Q9FPE8] | HHO3 |
| AT1G26250.1 | AT1G26250.1 | Proline-rich extensin-like family protein [Source:UniProtKB/TrEMBL;Acc:Q9C669] | |
| AT1G29910.1 | AT1G29910.1 | Chlorophyll a-b binding protein 3, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q8VZ87] | LHCB1.1 |
| AT1G29930.1 | AT1G29930.1 | Chlorophyll a-b binding protein 1, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:P04778] | LHCB1.3 |
| AT1G35560.1 | AT1G35560.1 | Transcription factor TCP23 [Source:UniProtKB/Swiss-Prot;Acc:Q9LQF0] | TCP23 |
| AT1G37130.1 | AT1G37130.1 | Nitrate reductase [NADH] 2 [Source:UniProtKB/Swiss-Prot;Acc:P11035] | NIA2 |
| AT1G43710.1 | AT1G43710.1 | SDC1 [Source:UniProtKB/TrEMBL;Acc:A0A178WA97] | SDC |
| AT1G50750.1 | AT1G50750.1 | Plant mobile domain protein family [Source:TAIR;Acc:AT1G50750] | |
| AT1G55580.1 | AT1G55580.1 | SCL18 [Source:UniProtKB/TrEMBL;Acc:A0A178WEV4] | SCL18 |
| AT1G55920.1 | AT1G55920.1 | Serine acetyltransferase 1, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q42588] | SAT1 |
| AT1G55990.1 | AT1G55990.1 | Glycine-rich protein [Source:UniProtKB/TrEMBL;Acc:F4I3I2] | |
| AT1G64185.1 | AT1G64185.1 | At1g64185 [Source:UniProtKB/TrEMBL;Acc:Q8LGF9] | |
| AT1G64190.1 | AT1G64190.1 | 6-phosphogluconate dehydrogenase, decarboxylating 1, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q9SH69] | PGD1 |
| AT1G64720.1 | AT1G64720.1 | Polyketide cyclase/dehydrase and lipid transport superfamily protein [Source:UniProtKB/TrEMBL;Acc:Q9XIR9] | CP5 |
| AT1G66180.1 | AT1G66180.1 | Eukaryotic aspartyl protease family protein [Source:UniProtKB/TrEMBL;Acc:Q9C8C9] | |
| AT1G68670.1 | AT1G68670.1 | Transcription factor HHO2 [Source:UniProtKB/Swiss-Prot;Acc:Q8VZS3] | HHO2 |
| AT1G72450.1 | AT1G72450.1 | TIFY11B [Source:UniProtKB/TrEMBL;Acc:A0A178W2J3] | TIFY11B |
| AT1G77760.1 | AT1G77760.1 | Nitrate reductase [Source:UniProtKB/TrEMBL;Acc:A0A178WBR8] | NIA1 |
| AT1G78050.1 | AT1G78050.1 | Phosphoglycerate/bisphosphoglycerate mutase [Source:UniProtKB/TrEMBL;Acc:F4I8M8] | PGM |
| AT1G78060.1 | AT1G78060.1 | Probable beta-D-xylosidase 7 [Source:UniProtKB/Swiss-Prot;Acc:Q9SGZ5] | BXL7 |
| AT1G78090.1 | AT1G78090.1 | Trehalose-phosphate phosphatase B [Source:UniProtKB/Swiss-Prot;Acc:Q9C9S4] | TPPB |
| AT1G80440.1 | AT1G80440.1 | F-box/kelch-repeat protein At1g80440 [Source:UniProtKB/Swiss-Prot;Acc:Q9M8L2] | |
| AT2G15620.1 | AT2G15620.1 | Ferredoxin–nitrite reductase, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q39161] | NIR1 |
| AT2G17820.1 | AT2G17820.1 | Histidine kinase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9SXL4] | AHK1 |
| AT2G22500.1 | AT2G22500.1 | Mitochondrial uncoupling protein 5 [Source:UniProtKB/Swiss-Prot;Acc:Q9SJY5] | PUMP5 |
| AT2G24540.1 | AT2G24540.1 | F-box protein AFR [Source:UniProtKB/Swiss-Prot;Acc:Q8LAW2] | AFR |
| AT2G27510.1 | AT2G27510.1 | Ferredoxin [Source:UniProtKB/TrEMBL;Acc:A0A178VWS0] | FD3 |
| AT2G30040.1 | AT2G30040.1 | Mitogen-activated protein kinase kinase kinase 14 [Source:UniProtKB/TrEMBL;Acc:O64741] | MAPKKK14 |
| AT2G30615.1 | AT2G30615.1 | F-box/LRR protein [Source:UniProtKB/TrEMBL;Acc:Q58G02] | |
| AT2G34080.1 | AT2G34080.1 | Cysteine proteinases superfamily protein [Source:UniProtKB/TrEMBL;Acc:O22961] | |
| AT2G34420.1 | AT2G34420.1 | Chlorophyll a-b binding protein, chloroplastic [Source:UniProtKB/TrEMBL;Acc:Q39141] | LHB1B2 |
| AT2G35637.1 | AT2G35637.1 | other RNA [Source:TAIR;Acc:AT2G35637] | |
| AT2G35640.1 | AT2G35640.1 | Homeodomain-like superfamily protein [Source:UniProtKB/TrEMBL;Acc:Q9ZQN7] | |
| AT2G36470.1 | AT2G36470.1 | DUF868 family protein, putative (DUF868) [Source:UniProtKB/TrEMBL;Acc:Q9SJQ8] | |
| AT2G36570.1 | AT2G36570.1 | Leucine-rich repeat receptor-like protein kinase PXC1 [Source:UniProtKB/Swiss-Prot;Acc:Q9SJQ1] | PXC1 |
| AT2G36580.1 | AT2G36580.1 | Pyruvate kinase [Source:UniProtKB/TrEMBL;Acc:A0A178VX82] | |
| AT2G36590.1 | AT2G36590.1 | Proline transporter 3 [Source:UniProtKB/Swiss-Prot;Acc:Q9SJP9] | PROT3 |
| AT2G38640.1 | AT2G38640.1 | Protein LURP-one-related 8 [Source:UniProtKB/Swiss-Prot;Acc:Q9ZVI6] | |
| AT2G39210.1 | AT2G39210.1 | At2g39210/T16B24.15 [Source:UniProtKB/TrEMBL;Acc:O80960] | |
| AT2G41440.1 | AT2G41440.1 | unknown protein; BEST Arabidopsis thaliana protein match is: unknown protein (TAIR:AT2G41470.1); Ha. [Source:TAIR;Acc:AT2G41440] | |
| AT2G45685.1 | AT2G45685.1 | other RNA [Source:TAIR;Acc:AT2G45685] | |
| AT3G02910.1 | AT3G02910.1 | Putative gamma-glutamylcyclotransferase At3g02910 [Source:UniProtKB/Swiss-Prot;Acc:Q9M8T3] | |
| AT3G02920.2 | AT3G02920.2 | Replication protein A 32 kDa subunit B [Source:UniProtKB/Swiss-Prot;Acc:Q8LFJ8] | RPA2B |
| AT3G07350.1 | AT3G07350.1 | F21O3.6 protein [Source:UniProtKB/TrEMBL;Acc:Q9SRT1] | |
| AT3G16240.1 | AT3G16240.1 | Aquaporin TIP2-1 [Source:UniProtKB/Swiss-Prot;Acc:Q41951] | TIP2-1 |
| AT3G19430.1 | AT3G19430.1 | Late embryogenesis abundant protein-related / LEA protein-like protein [Source:UniProtKB/TrEMBL;Acc:A0A178VGJ3] | |
| AT3G24520.1 | AT3G24520.1 | Heat stress transcription factor C-1 [Source:UniProtKB/Swiss-Prot;Acc:Q9LV52] | HSFC1 |
| AT3G27180.1 | AT3G27180.1 | At3g27180 [Source:UniProtKB/TrEMBL;Acc:B3DNP3] | |
| AT3G28550.1 | AT3G28550.1 | Proline-rich extensin-like family protein [Source:UniProtKB/TrEMBL;Acc:F4J0B5] | |
| AT3G47520.1 | AT3G47520.1 | Malate dehydrogenase, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q9SN86] | MDH |
| AT3G47980.1 | AT3G47980.1 | At3g47980 [Source:UniProtKB/TrEMBL;Acc:Q84MC3] | |
| AT3G48360.1 | AT3G48360.1 | BTB/POZ and TAZ domain-containing protein 2 [Source:UniProtKB/Swiss-Prot;Acc:Q94BN0] | BT2 |
| AT3G49940.1 | AT3G49940.1 | LOB domain-containing protein 38 [Source:UniProtKB/Swiss-Prot;Acc:Q9SN23] | LBD38 |
| AT3G61580.1 | AT3G61580.1 | Delta(8)-fatty-acid desaturase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9ZRP7] | SLD1 |
| AT4G13918.1 | AT4G13918.1 | Potential natural antisense gene, locus overlaps with AT4G13920 [Source:TAIR;Acc:AT4G13918] | |
| AT4G25835.1 | AT4G25835.1 | AAA-ATPase At4g25835 [Source:UniProtKB/Swiss-Prot;Acc:Q8RY66] | |
| AT4G29700.1 | AT4G29700.1 | Alkaline-phosphatase-like family protein [Source:UniProtKB/TrEMBL;Acc:Q9SU81] | |
| AT4G31910.1 | AT4G31910.1 | Brassinosteroid-related acyltransferase 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9SZ58] | BAT1 |
| AT4G32950.1 | AT4G32950.1 | Probable protein phosphatase 2C 61 [Source:UniProtKB/Swiss-Prot;Acc:O82637] | |
| AT4G36850.1 | AT4G36850.1 | PQ-loop repeat family protein / transmembrane family protein [Source:UniProtKB/TrEMBL;Acc:Q94AH7] | |
| AT4G37540.1 | AT4G37540.1 | LOB domain-containing protein 39 [Source:UniProtKB/Swiss-Prot;Acc:Q9SZE8] | LBD39 |
| AT4G37608.1 | AT4G37608.1 | Uncharacterized protein AT4g37600 [Source:UniProtKB/TrEMBL;Acc:Q9SZF3] | |
| NA | NA | NA | NA |
| AT4G38340.1 | AT4G38340.1 | Protein NLP3 [Source:UniProtKB/Swiss-Prot;Acc:Q9SVF1] | NLP3 |
| AT5G01215.1 | AT5G01215.1 | other RNA [Source:TAIR;Acc:AT5G01215] | |
| AT5G01215.2 | AT5G01215.2 | other RNA [Source:TAIR;Acc:AT5G01215] | |
| AT5G01542.1 | AT5G01542.1 | Potential natural antisense gene, locus overlaps with AT5G01540 [Source:TAIR;Acc:AT5G01542] | |
| AT5G04840.1 | AT5G04840.1 | At5g04840 [Source:UniProtKB/TrEMBL;Acc:Q8L5X9] | |
| AT5G09800.1 | AT5G09800.1 | U-box domain-containing protein 28 [Source:UniProtKB/Swiss-Prot;Acc:Q9LXE3] | PUB28 |
| AT5G13110.1 | AT5G13110.1 | Glucose-6-phosphate 1-dehydrogenase 2, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q9FY99] | G6PD2 |
| AT5G13420.1 | AT5G13420.1 | Aldolase-type TIM barrel family protein [Source:UniProtKB/TrEMBL;Acc:Q9LYR4] | |
| AT5G14760.1 | AT5G14760.1 | L-aspartate oxidase, chloroplastic [Source:UniProtKB/Swiss-Prot;Acc:Q94AY1] | AO |
| AT5G20540.1 | AT5G20540.1 | Protein Brevis radix-like 4 [Source:UniProtKB/Swiss-Prot;Acc:Q8GZ92] | BRXL4 |
| AT5G25460.1 | AT5G25460.1 | At5g25460 [Source:UniProtKB/TrEMBL;Acc:Q94F20] | |
| AT5G47220.1 | AT5G47220.1 | ERF2 [Source:UniProtKB/TrEMBL;Acc:A0A178UCP3] | ERF2 |
| AT5G49480.1 | AT5G49480.1 | Calcium-binding protein CP1 [Source:UniProtKB/Swiss-Prot;Acc:Q9FDX6] | ATCP1 |
| AT5G51830.1 | AT5G51830.1 | Probable fructokinase-7 [Source:UniProtKB/Swiss-Prot;Acc:Q9FLH8] | |
| AT5G54170.1 | AT5G54170.1 | Polyketide cyclase/dehydrase and lipid transport superfamily protein [Source:UniProtKB/TrEMBL;Acc:F4JYV3] | |
| AT5G57770.1 | AT5G57770.1 | Plant protein of unknown function (DUF828) with plant pleckstrin homology-like region [Source:TAIR;Acc:AT5G57770] | |
| AT5G63160.1 | AT5G63160.1 | BTB/POZ and TAZ domain-containing protein 1 [Source:UniProtKB/Swiss-Prot;Acc:Q9FMK7] | BT1 |
| AT5G67420.1 | AT5G67420.1 | LOB domain-containing protein 37 [Source:UniProtKB/Swiss-Prot;Acc:Q9FN11] | LBD37 |